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Abstract 

We connect a broad class of generative models through their shared reliance on 
sequential decision making. Motivated by this view, we develop extensions to an 
existing model, and then explore the idea further in the context of data imputation 
- perhaps the simplest setting in which to investigate the relation between uncon¬ 
ditional and conditional generative modelling. We formulate data imputation as 
an MDP and develop models capable of representing effective policies for it. We 
construct the models using neural networks and train them using a form of guided 
policy search im. Our models generate predictions through an iterative process 
of feedback and refinement. We show that this approach can learn effective poli¬ 
cies for imputation problems of varying difficulty and across multiple datasets. 


1 Introduction 

Directed generative models are naturally interpreted as specifying sequential procedures for gener¬ 
ating data. We traditionally think of this process as sampling, but one could also view it as making 
sequences of decisions for how to set the variables at each node in a model, conditioned on the 
settings of its parents, thereby generating data from the model. The large body of existing work 
on reinforcement learning provides powerful tools for addressing such sequential decision making 
problems. We encourage the use of these tools to understand and improve the extended processes 
currently driving advances in generative modelling. We show how sequential decision making can be 
applied to general prediction tasks by developing models which construct predictions by iteratively 
refining a working hypothesis under guidance from exogenous input and endogenous feedback. 

We begin this paper by reinterpreting several recent generative models as sequential decision making 
processes, and then show how changes inspired by this point of view can improve the performance 
of the LSTM-based model introduced in ID. Next, we explore the connections between directed 
generative models and reinforcement learning more fully by developing an approach to training 
policies for sequential data imputation. We base our approach on formulating imputation as a finite- 
horizon Markov Decision Process which one can also interpret as a deep, directed graphical model. 

We propose two policy representations for the imputation MDP. One extends the model in ID by 
inserting an explicit feedback loop into the generative process, and the other addresses the MDP 
more directly. We train our models/policies using techniques motivated by guided policy pearch M 
[Eiiniiii. We examine their qualitative and quantitative performance across imputation problems 
covering a range of difficulties (i.e. different amounts of data to impute and different “missingness 
mechanisms”), and across multiple datasets. Given the relative paucity of existing approaches to the 
general imputation problem, we compare our models to each other and to two simple baselines. We 
also test how our policies perform when they use fewer/more steps to refine their predictions. 

As imputation encompasses both classification and standard (i.e. unconditional) generative mod¬ 
elling, our work suggests that further study of models for the general imputation problem is worth¬ 
while. The performance of our models suggests that sequential stochastic construction of predic¬ 
tions, guided by both input and feedback, should prove useful for a wide range of problems. Training 
these models can be challenging, but lessons from reinforcement learning may bring some relief. 
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2 Directed Generative Models as Sequential Decision Processes 


Directed generative models have grown in popularity relative to their undirected counter-parts H] 
[l6][l4l|5ll7l[l8l[T7l (etc.). Reasons include; the development of efficient methods for training them, 
the ease of sampling from them, and the tractability of bounds on their log-likelihoods. Growth in 
available computing power compounds these benefits. One can interpret the (ancestral) sampling 
process in a directed model as repeatedly setting subsets of the latent variables to particular values, 
in a sequence of decisions conditioned on preceding decisions. Each subsequent decision restricts 
the set of potential outcomes for the overall sequence. Intuitively, these models encode stochastic 
procedures for constructing plausible observations. This section formally explores this perspective. 

2.1 Deep AutoRegressive Networks 

The deep autoregressive networks investigated in 0 define distributions of the following form; 

T 

pix) = '^pix\z)p{z), with p{z) =pQ{zo)Y[pt{zt\zo,...,Zt-i) (1) 

2 t—1 

in which x indicates a generated observation and zq, ...,zt represent latent variables in the model. 
The distribution p{x\z) may be factored similarly to p{z). The form of p{z) in Eqn.[^can represent 
arbitrary distributions over the latent variables, and the work in 0 mainly concerned approaches 
to parameterizing the conditionals pt{zt\zQ ,..., zt-i) that restricted representational power in ex¬ 
change for computational tractability. To appreciate the generality of Eqn. consider using zt that 
are univariate, multivariate, structured, etc. One can interpret any model based on this sequential 
factorization of p{z) as a non-stationary policy pt{zt\st) for selecting each action zt in a state st, 
with each st determined by all zt' for t' < t, and train it using some form of policy search. 


2.2 Generalized Guided Policy Search 


We adopt a broader interpretation of guided policy search than one might initially take from, e.g., 
muiiiiiniiioi. We provide a review of guided policy search in the supplementary material. Our 
expanded definition of guided policy search includes any optimization of the general form; 


minimize 

P .9 


E 


E 

(■ hg 


E 


[f(T, iq, ip)] -f Adiv {q{T\iq, ip),p(r|ip)) 


( 2 ) 


in which p indicates the primary policy, q indicates the guide policy, Ig indicates a distribution over 
information available only to q. Ip indicates a distribution over information available to both p and 
q, £{t, iq, ip) computes the cost of trajectory r in the context of iq/ip, and div(g(T|iq, ip),p{T\ip)) 
measures dissimilarity between the trajectory distributions generated by p/q. As A > 0 goes to 
infinity, Eqn. [^enforces the constraint p(r|ip) = q{T\iq,ip), \/T,ip,iq. Terms for controlling, e.g., 
the entropy of p/q can also be added. The power of the objective in Eq. [^sterns from two main 
points; the guide policy q can use information iq that is unavailable to the primary policy p, and the 
primary policy need only be trained to minimize the dissimilarity term div(g(T|iq, ip),p{T\ip)). 


Eor example, a directed model structured as in Eqn. [T] can be interpreted as specifying a policy for 
a finite-horizon MDP whose terminal state distribution encodes p{x). In this MDP, the state at time 
l<f<T-|-lis determined hy {zq, Zt-i}. The policy picks an action zt S Zt at time 1 < t < T, 
and picks an action a; € A at time t = T + 1. I.e., the policy can be written as pt{zt\zQ, ..., Zt-i) 
for 1 < f < T, and as p{x\zo, ..., zt) for t = T + 1. The initial state zq G Zq is drawn frompo(.2^o)- 
Executing the policy for a single trial produces a trajectory r = {zq, ..., zt, x}, and the distribution 
over xs from these trajectories is just p(x) in the corresponding directed generative model. 


The authors of 0 train deep autoregressive networks by maximizing a variational lower bound on 
the training set log-likelihood. To do this, they introduce a variational distribution q which provides 
qo{zo\x*) and qt{zt\zo,Zt-i,x*) for 1 < f < T, with the final step q{x\zQ, zj-, x*) given by 
a Dirac-delta at x* . Given these definitions, the training in 0 can be interpreted as guided policy 
search for the MDP described in the previous paragraph. Specifically, the variational distribution q 
provides a guide policy q[T\x*) over trajectories t = {zq, zt,x*}: 

T 

q{T\x*) = q{x\zQ,...,ZT,x*)qQ{zQ\x*)W^qt{zt\zQ,...,Zt-i,x*) (3) 
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The primary policy p generates trajectories distributed according to: 


T 

p{t) = p(x\zo, zt)po{zo) W_pt{zt\zo,Zt-i) (4) 

t=i 


which does not depend on x* . In this case, x* corresponds to the guide-only information iq ~ Iq in 
Eqn.|^ We now rewrite the variational optimization as: 


minimize E 

p,q x*r^Vx 


E, .[^iT,x*)]+KL{q{T\x*)\\p{T)) 

T^q{T\x*) 


(5) 


where f'fr, a;*) = 0 and indicates the target distribution for the terminal state of the primary 
policy pQ When expanded, the KL term in Eqn. [^becomes: 


KL(g(r|a;*) ||p(r)) = 


(6) 


E 

T~q(T\x*) 


Po(2o) 


T 

+y]iog 


qt{zt\zo,...,Zt-i,x*) 

Pt{zt\zo,...,Zt-i) 


logp(a:*|zo, ...,zt) 


Thus, the variational approach used in Q for training directed generative models can be interpreted 
as a form of generalized guided policy search. As the form in Eqn.[T]can represent any hnite directed 
generative model, the preceding derivation extends to all models we discuss in this paperj^ 


2.3 Time-reversible Stochastic Processes 


One can simplify Eqn. T]by assuming suitable forms for X and Zq, Zt- E.g., the authors of IIT^ 
proposed a model in which Zt = X for all t and po(a:o) was Gaussian. We can write their model as: 

T-l 

P{xt)= X! PT[xT\xT-l)Po{,Xo)W_Pt{xt\Xt-l) (7) 

Xq,...,Xt- 1 t — 1 

where p{xt) indicates the terminal state distribution of the non-stationary, hnite-horizon Markov 
process determined by {po(a:o),pi(a;i|a;o), ...,pt{xt\xt-i)}- Note that, throughout this paper, we 
(ab)use sums over latent variables and trajectories which could/should be written as integrals. 


The authors of M observed that, for any reasonably smooth target distribution Vx and sufficiently 
large T, one can dehne a “reverse-time” stochastic process qt{xt-i\xt) with simple, time-invariant 
dynamics that transforms q{xT) — 'Dx into the Gaussian distribution po{xo)- This q is given by: 

T 

qo{xo)= qi{xo\xi)'Dx{xT)Y[qt{xt-i\xt) ^ po{xo) (8) 

xi,...,xt t—2 


Next, we dehne ^(t) as the distribution over trajectories r = {xq, ..., xt} generated by the reverse¬ 
time process determined by {qi(xo|a;i), ..., qxixx-i\ xt), T^xixr)}- We dehne p(t) as the distri¬ 
bution over trajectories generated by the “forward-time” process in Eqn. The training in ifTSll is 
equivalent to guided policy search using guide trajectories sampled from q, i.e. it uses the objective: 


minimize 
p,q 1 


E 

■^q{T) 


log 


gl {xo\xi) 
Po{xo) 


, qt+i{xt\xt+i) . 

V log —^ -- -f log 

Pt{Xt\Xt-l) 


Z>x{xt) 

PriXTlxT-l) 


(9) 


which corresponds to minimizing ¥Xj{q ||p). If the log-densities in Eqn. are tractable, then this 
minimization can done using basic Monte-Carlo. If, as in ifTsl . the reverse-time process q is not 

^OgPoixo) -J2j=1^0gPt{xt\xt-i) . 


trained, then Eqn.j^simplihes to: minimizCp '^q(r 


This trick for generating guide trajectories exhibiting a particular distribution over terminal states 
Xt - i-C- running dynamics backwards in time starting from xt ^ 'Dx - may prove useful in settings 
other than those considered in ifTSl . E.g., the LapGAN model in ^ learns to approximately invert 
a hxed (and information destroying) reverse-time process. The supplementary material expands on 
the content of this subsection, including a derivation of Eqn.j^as a bound on log7'(2:)]- 


’We could pull the — logp(a:*|2:o, term from the KL and put it in the cost £{t, x*), but we prefer the 

“path-wise KL” formulation for its elegance. We abuse notation using KL{5{x = x*) || p{x)) = — logp(a;*). 
^This also includes all generative models implemented and executed on an actual computer. 
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2.4 Learning Generative Stochastic Processes with LSTMs 


The authors of ID introduced a model for sequentially-deep generative processes. We interpret their 
model as a primary policy p which generates trajectories r = {zq, zt, with distribution: 

T 

P{t) = p{x\sg{T^a:))Po{zo)Wpt{zt), with = {zq, ^t} (10) 

i=l 

in which indicates a latent trajectory and sg{T^x) indicates a state trajectory {sq, sy} com¬ 
puted recursively from using the update st ^ fg{st-i, Zt) for f > 1. The initial state sq is 
given by a trainable constant. Each state st = [hp, Vt] represents the joint hidden/visible state ht/vt 
of an LSTM and /g(state, input) computes a standard LSTM updated The authors of ID defined 
all pt{zt) as isotropic Gaussians and defined the output distribution p(xjse(r< 2 ;)) as p(a:|cT), where 
cr — Co + Here, cq is a trainable constant and wg{vt) is, e.g., an affine transform of 

Vt- Intuitively, ujg{vt) transforms vt into a refinement of the “working hypothesis” Ct-i, which gets 
updated to ct = Ct-i + ujg{vt). p is governed by parameters 9 which affect fg, ojg, sq, and cq. The 
supplementary material provides pseudo-code and an illustration for this model. 


To train p, the authors of ||4l introduced a guide policy q with trajectory distribution: 

T 

q{T\x*) = q{x\s^{T<j;),x*)qo{zo\x*)Y[qt{zt\st,x*), with = {zq, zt} (11) 

t=i 

in which indicates a state trajectory {sq, ..., st} computed recursively from using the 

guide policy’s state update st ^ x*)). In this update st_i is the previous guide 

state and g<i,{sg{T^t), x*) is a deterministic function of x* and the partial (primary) state trajectory 
se{T<:t) — {sq, ■■■, St-i}, which is computed recursively from = {zg,..., Zt-i} using the state 
update St ^ fg{st-i, Zt). The output distribution q{x\s^{T<cx),x*) is defined as a Dirac-delta at 
x*^ Each qt{zt\st, a;*) is a diagonal Gaussian distribution with means and log-variances given by 
an affine function L^(vt) of Vt- qo('2o) is defined as identical to po(zo). q is governed by parameters 
(j) which affect the state updates frj,{st-i,g 4 ,{sg{T^t),x*)) and the step distributions qt{zt\st,x*). 
g^ise{T<^t), X*) corresponds to the “read” operation of the encoder network in ID. 


Using our definitions for p/q, the training objective in ID is given by: 

T 


minimize E E 

P,q X*r^Vx Tr^q(T\x*) 


X^log ' -^0Sp{x*\s{t<x)) 




Ptizt) 


( 12 ) 


which can be written more succinctly as ^x'-r-^Vx KL(g(T|a;*) 11 pir)). This objective upper-bounds 
^x-^-Dxi- logp(a:*)], where p(a:) = P(a;|'Se(r<x))p(T<x). 


2.5 Extending the LSTM-based Generative Model 


We propose changing p in Eqn. 10 to: p{t) = p{x\sg{T<^x))PQ{zQ) Ylt^iPt{zt\st-i). We define 


Pt{zt\st-i) as a diagonal Gaussian distribution with means and log-variances given by an affine 
function Lg{vt-i) of Vt-i (remember that St = [hp^Vt]), and we define Pq{zq) as an isotropic 
Gaussian. We set sg using sg •(— fg{zQ), where fg is a trainable function (e.g. a neural network). 
Intuitively, our changes make the model more like a typical policy by conditioning its “action” zt on 
its state St_i, and upgrade the model to an infinite mixture by placing a distribution over its initial 
state Sg. We also consider using Ct = Lg(ft,t), which transforms the hidden part of the LSTM state st 
directly into an observation. This makes ht a working memory in which to construct an observation. 
The supplementary material provides pseudo-code and an illustration for this model. 


We train this model by optimizing the objective: 

qo{zo\x*) 


minimize 

p,q ■ 


E 


E 

■^q{T\x* 


log 


Poizo) 




qt{zt\st,x*) 

Ptizt\st-i) 


- logp{x*\s{T^x)) 


(13) 


^For those unfamiliar with LSTMs, a good introduction can be found in 0. We use LSTMs including input 
gates, forget gates, output gates, and peephole connections for all tests presented in this paper. 

"'it may be useful to relax this assumption. 
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where we now have to deal with pt{zt\st-i), Po{zq), and (7o(zo|a;*), which could be treated as 
constants in the model from S). We define go('2^o|2;*) as a diagonal Gaussian distribution whose 
means and log-variances are given by a trainable function g^{x*). 


When trained for the binarized MNIST benchmark 
used in Q, our extended model scored a negative 
log-likelihood of 85.5 on the test set For compari¬ 
son, the score reported in il was 87.40 After fine- 
tuning the variational distribution (i.e. q) on the test 
set, our model’s score improved to 84.8, which is 
quite strong considering it is an upper bound. For 
comparison, see the best upper bound reported for 
this benchmark in El, which was 85.1. When the 
model used the alternate ct = LeihT), the raw/fine- 
tuned test scores were 85.9/85.3. Fig. [T] shows 
samples from the model. Model/test code is avail¬ 
able at http://github.com/Philip-Bachman/ 
Sequential-Generation 



Figure 1: The left block shows a{ct) for t G 
{1,3, 5,9,16}, for a policy p with ct = cq -f 
Lgivt'). The right block is analogous, 
for a model using ct = Lg{ht). 


3 Developing Models for Sequential Imputation 

The goal of imputation is to estimate |a;*), where x = [x“; x*] indicates a complete observation 
with known values x^ and missing values x“. We define a mask m € At as a (disjoint) partition of 
X into By expanding x“ to include all of x, one recovers standard generative modelling. By 

shrinking x“ to include a single element of x, one recovers standard classification/regression. Given 
distribution Vm over m G A4 and distribution Vx over x G X, the objective for imputation is: 

minimize E E f—logp(x“|x^')l (14) 


We now describe a finite-horizon MDP for which guided policy search minimizes a bound on the 
objective in Eqn. 14 The MDP is defined by mask distribution Dai, complete observation distri¬ 
bution Vx, and the state spaces {Zq, ..., Zx} associated with each of T steps. Together, 

Vx define a joint distribution over initial states and rewards in the MDP. For the trial determined 
by X ~ Vx and m ^ Dai, the initial state zq ~ p{zq\x^) is selected by the policy p based on the 
known values x^'. The cost f (r, x“, x^) suffered by trajectory r = {zq, ■■■, zt} in the context (x, m) 
is given by — logp(x“|r, x^), i.e. the negative log-likelihood of p guessing the missing values x“ 
after following trajectory r, while seeing the known values x^. 


We consider a policy p with trajectory distribution p(r|x^) = p{zf)\x'^) Ylt=i Pi^t\zo, 

where x^ is determined by x/m for the current trial and p can’t observe the missing values x“. With 

these definitions, we can find an approximately optimal imputation policy by solving: 


minimize EE E 

p T~p(r|ic^) 


[-l 0 gp(x“|T,x'=)] 


(15) 


I.e. the expected negative log-likelihood of making a correct imputation on any given trial. This is a 
valid, but loose, upper bound on the imputation objective in Eq.[^(from Jensen’s inequality). We 
can tighten the bound by introducing a guide policy (i.e. a variational distribution). 

As with the unconditional generative models in Sec.[^ we train p to imitate a guide policy q shaped 
by additional information (here it’s x“). This q generates trajectories with distribution g(r|x“, x^') = 
q{zo\x^, x^) nt=i <i{^t\z.o, zt-i,x'^, x^). Given this p and q, guided policy search solves: 


minimize EE E [-log g(x“|r, i„, („)]-f KL(g(T|i„, („) 11 p(T|ip)) 

p,q X'^'Dx mr^DjKA [T~g(r|2g,2p) 

where we define iq = x“, ip = x^, and 9(x“|r, Zg, tp) — p(a:”|r, Zp)- 

^Data splits from! http : / /www. cs . tor onto. edu / ~ larocheh/public/d.atasets/binarized_mnist 

®The model in (4) significantly improves its score to 80.97 when using an image-specific architecture. 


( 16 ) 
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3.1 A Direct Representation for Sequential Imputation Policies 


We define an imputation trajectory as Cr = {cq, where each partial imputation ct G A is 

computed from a partial step trajectory T<t = {zi,zt}. A partial imputation ct-i encodes the 
policy’s guess for the missing values immediately prior to selecting step zt, and ct gives the pol¬ 
icy’s hnal guess. At each step of iterative rehnement, the policy selects a zt based on ct-i and the 
known values and then updates its guesses to Ct based on Ct-i and zt- By iteratively rehning its 
guesses based on feedback from earlier guesses and the known values, the policy can construct com¬ 
plexly structured distributions over its hnal guess cy after just a few steps. This happens naturally, 
without any post-hoc MRFs/CRFs (as in many approaches to structured prediction), and without 
sampling values in ct one at a time (as required by existing NADE-type models ||9l). This property 
of our approach should prove useful for many tasks. 

We consider two ways of updating the guesses in ct, mirroring those described in Sec. 13 The hrst 
way sets ct ^ Ct_i -I- LOg{zt), where uje{zt) is a trainable function. We set Cq = [cq; eg] using a 
trainable bias. The second way sets ct <— ujg{zt). We indicate models using the hrst type of update 
with the suffix -add, and models using the second type of update with -jump. Our primary policy pg 
selects Zt at each step 1 < f < T using pg|ct_i, x^), which we restrict to be a diagonal Gaussian. 
This is a simple, stationary policy. Together, the step selector pg( 2 ;t|ct_i, x^) and the imputation 
constructor ojg{zt) fully determine the behaviour of the primary policy. The supplementary material 
provides pseudo-code and an illustration for this model. 


We construct a guide policy q similarly to p. The guide policy shares the imputation constructor 
with the primary policy. The guide policy incorporates additional information X = [x“; x^], 
i.e. the complete observation for which the primary policy must reconstruct some missing values. 
The guide policy chooses steps using qrj,{zt\ct-\,x), which we restrict to be a diagonal Gaussian. 


We train the primary/guide policy components uig, pg, and q^ simultaneously on the objective: 


minimize E E 

9,4> X'^T>x m~X>^ 


E 

T~q,f,{T\x-^ ,x>‘) 


[-log g(x“|c^)] + KL(g(r |x“, x'^) 11 p(r|x'=)) 


(17) 


where g(x“|cy) = p(x"|c ^). We train our models using Monte-Carlo roll-outs of q, and stochastic 
backpropagation as in ll3 [16l . Full implementations and test code are available from http:// 
github.com/Philip-Bachman/Sequential-Generation 


3.2 Representing Sequential Imputation Policies using LSTMs 


To make it useful for imputation, which requires conditioning on the exogenous information x^ 


modify the LSTM-based model from Sec. 2.5 to include a “read” operation in its primary policy p. 
We incorporate a read operation by spreading p over two LSTMs, p^ and p'^, which respectively 
“read” and “write” an imputation trajectory Ct = {co,...,ct}. Conveniently, the guide policy q 
for this model takes the same form as the primary policy’s reader p’'. This model also includes an 
“inhnite mixture” initialization step, as used in Sec. |2.5| but modihed to incorporate conditioning on 
X and m. The supplementary material provides pseudo-code and an illustration for this model. 

Following the inhnite mixture initialization step, a single full step of execution for p involves several 
substeps: hrst p updates the reader state using ^ then p selects a 

step Zt ^ pg{zt\vl), then p updates the writer state using ^ Zt), and hnally p updates 

its guesses by setting c* ^ Ct-i -I-(or c* In these updates, 

refer to the states of the (r)reader and (wjwriter LSTMs. The LSTM updates fg’ 
operations tOg’^ are governed by the policy parameters 9. 

We train p to imitate trajectories sampled from a guide policy q. The guide policy shares the primary 
policy’s writer updates and write operation but has its own reader updates and read oper¬ 
ation At each step, the guide policy: updates the guide state ^ f^{st_i,uj'^{ct-i, x)), 
then selects zt ^ q^{zt\vt), then updates the writer state sf •(— , Zt), and hnally updates 

its guesses ct ^ Ct-i + (or ct <— uj^(h^)). As in Sec. 3.1 the guide policy’s read op- 


Vf 


and the read/write 


eration gets to see the complete observation x, while the primary policy only gets to see the 
known values x^ We restrict the step distributions pg /to be diagonal Gaussians whose means 


and log-variances are affine functions of /v^. The training objective has the same form as Eq. 17 
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(a) (b) (c) 


Figure 2: (a) Comparing the performance of our imputation models against several baselines, using 
MNIST digits. The x-axis indicates the % of pixels which were dropped completely at random, and 
the scores are normalized by the number of imputed pixels, (b) A closer view of results from (a), 
just for our models, (c) The effect of increased iterative rehnement steps for our GPSI models. 


4 Experiments 


We tested the performance of our sequential imputation models on three datasets: MNIST (28x28), 
SVHN (cropped, 32x32) ifHl . and TFD (48x48) IT^ . We converted images to grayscale and 
shift/scaled them to be in the range [0...1] prior to training/testing. We measured the imputation 
log-likelihood \ogq{x'^\d^) using the true missing values x“ and the models’ guesses given by 
cr(cy). We report negative log-likelihoods, so lower scores are better in all of our tests. We refer to 
variants of the model from Sec. o as GPSTadd and GPSTjump, and to variants of the model from 
Sec. |3.2| as LSTM-add and LSTM-jump. Except where noted, the GPSI models used 6 refinement 
steps and the LSTM models used 16|^ 

We tested imputation under two types of data masking: missing completely at random (MCAR) 
and missing at random (MAR). In MCAR, we masked pixels uniformly at random from the source 
images, and indicate removal of d% of the pixels by MCAR-d. In MAR, we masked square regions, 
with the occlusions located uniformly at random within the borders of the source image. We indicate 
occlusion of a d X d square by MAR-d. 

On MNIST, we tested MCAR-d for d S {50, 60, 70,80, 90}. MCAR-100 corresponds to uncon¬ 
ditional generation. On TFD and SVHN we tested MCAR-80. On MNIST, we tested MAR-d for 
d S {14,16}. On TFD we tested MAR-25 and on SVHN we tested MAR-17. For test trials we 
sampled masks from the same distribution used in training, and we sampled complete observations 
from a held-out test set. Fig. and Tab. present quantitative results from these tests. Fig. |^c) 
shows the behavior of our GPSI models when we allowed them fewer/more refinement steps. 

MNIST TFD SVHN 



MAR-14 

MAR-16 

MCAR-80 

MAR-25 

MCAR-80 

MAR-17 

LSTM-add 

170 

167 

1381 

1377 

525 

568 

LSTM-jump 

172 

169 

- 

- 

- 

- 

GPSI-add 

177 

175 

1390 

1380 

531 

569 

GPSI-jump 

183 

177 

1394 

1384 

540 

572 

VAE-imp 

374 

394 

1416 

1399 

567 

624 


Table 1: Imputation performance in various settings. Details of the tests are provided in the main 
text. Lower scores are better. Due to time constraints, we did not test LSTM-jump on TFD or 
SVHN. These scores are normalized for the number of imputed pixels. 


We tested our models against three baselines. The baselines were “variational auto-encoder impu¬ 
tation”, honest template matching, and oracular template matching. VAE imputation ran multiple 
steps of VAE reconstruction, with the known values held fixed and the missing values re-estimated 
with each reconstruction stepj^ After 16 rehnement steps, we scored the VAE based on its best 

’GPSI stands for “Guided Policy Search Imputer”. The tag “-add” refers to additive guess updates, and 
“-jump” refers to updates that fully replace the guesses. 

*We discuss some deficiencies of VAE imputation in the supplementary material. 
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Figure 3: This figure illustrates the policies learned by our models, (a): models trained for (MNIST, 
MAR-16). From top—^bottom the models are: GPSTadd, GPSTjump, LSTM-add, LSTM-jump. 
(b): models trained for (TFD, MAR-25), with models in the same order as (a) - but without LSTM- 
jump. (c): models trained for (SVHN, MAR-17), with models arranged as for (b). 


guesses. Honest template matching guessed the missing values based on the training image which 
best matched the test image’s known values. Oracular template matching was like honest template 
matching, but matched directly on the missing values. 


Our models significantly outperformed the baselines. In general, the LSTM-based models outper¬ 
formed the more direct GPS I models. We evaluated the log-likelihood of imputations produced by 
our models using the lower bounds provided by the variational objectives with respect to which they 
were trained. Evaluating the template-based imputations was straightforward. For VAE imputation, 
we used the expected log-likelihood of the imputations sampled from multiple runs of the 16-step 
imputation process. This provides a valid, but loose, lower bound on their log-likelihood. 


As shown in Fig.[^ the imputations produced by our models appear promising. The imputations are 
generally of high quality, and the models are capable of capturing strongly multi-modal reconstruc¬ 
tion distributions (see subfigure (a)). The behavior of GPSI models changed intriguingly when we 
swapped the imputation constructor. Using the -jump imputation constructor, the imputation pol¬ 
icy learned by the direct model was rather inscrutable. Eig. |^c) shows that additive guess updates 
extracted more value from using more refinement steps. When trained on the binarized MNIST 
benchmark discussed in Sec. 2.5 i.e. with binarized images and subject to MCAR-100, the LSTM- 
add model produced raw/fine-tuned scores of 86.2/85.7. The LSTM-jump model scored 87.1/86.3. 
Anecdotally, on this task, these “closed-loop” models seemed more prone to overfitting than the 
“open-loop” models in Sec. 2.5 The supplementary material provides further qualitative results. 


5 Discussion 


We presented a point of view which links methods for training directed generative models with 
policy search in reinforcement learning. We showed how our perspective can guide improvements 
to existing models. The importance of these connections will only grow as generative models rapidly 
increase in structural complexity and effective decision depth. 

We introduced the notion of imputation as a natural generalization of standard, unconditional gen¬ 
erative modelling. Depending on the relation between the data-to-generate and the available infor¬ 
mation, imputation spans from full unconditional generative modelling to classification/regression. 
We showed how to successfully train sequential imputation policies comprising millions of parame¬ 
ters using an approach based on guided policy search mi. Our approach outperforms the baselines 
quantitatively and appears qualitatively promising. Incorporating, e.g., the local read/write mecha¬ 
nisms from m should provide further improvements. 
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6 Appendix 


7 Additional Material for Section]^ 


7.1 A Brief Review of Policy Search and Guided Policy Search 


Policy search refers to a general class of methods for searching directly through the space of possible 
parameterized policies for a reinforcement learning system (in contrast to fitting a value function and 
determining the policy implicitly by choosing the best actions). However, policy search is subject 
to local optima, which can be quite bad if the policy space is very rich (e.g., policies represented 
by deep networks). Guided policy search methods mu Ha [13 Ho) address this problem by using 
either guiding samples, or a guide policy (which generates guiding samples), in order to help move 
the policy search away from bad local optima. We refer to “local optima” in a colloquial/practical 
sense. I.e. regions of policy space in which the policy is unlikely to improve via noisy local search. 

The initial approach to this problem was to generate guiding samples from policies obtained through 
trajectory optimization using differential dynamic programming im. After applying importance 
sampling corrections, the guiding samples were then used for off-policy training of the primary pol¬ 
icy, a standard approach in policy search. Further work has obtained samples by using a “guide 
policy” which typically belongs to a larger policy class than the one being searched |[T3[T0l . In both 
cases, the optimization criterion contains, in addition to the reward, a regularization term requir¬ 
ing trajectories from the trained policy to be close to the guide samples. Constraining divergence 
between the guide samples and the trajectories produced by the trained policy allows the system 
generating the guide samples to gradually pull the trained policy towards improved behavior. 


7.2 A Path-wise KL Bound for Reversible Stochastic Processes 


We now show that the objective in Eqn. describes the KL divergence KL{qT \ \pt), and that it 
provides an upper bound on [— logp(a;r)]. First, for r = {xq, ..., xp}, we define: 


• F(f>oko) = p{xi, ...,Xt\xo) = Y\J=iPt{xt\xt-i) 


• p{x) = p{xi, ...,Xt\xo)po{xo) = Po(xo) nlLiPt(xtjxt-i) 


• g(x<TiXT) = g(xo, ■■.,XT-liXT) = nLl ^t(xt-ljxt) 


• g(r) = q(xo, ■■■,XT-lixT)T>x(xT) = ^x(xT)n^=iqt(xt-iixt) 
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Next, we derive: 


p{xt) = 


_ , q{T<T\xT) 


Xq,...,Xt -1 




Po(a:o)p(T>o|a;o) 

g(T<T|a:r) 




^ g(r<T|a;T) 




^ g(xo,...,a;T-i|a:T)- Po(a;o)n 




Xq,...,Xt -1 


gt(a:t-i|a;t) y 

logp(a;T) > ^ g(a:o,...,a:T-i|a:T) - log (Po(a:^o) JJ 

logpo(a;o) - logJI 

J 

logpo(a:o) - log ■ 


E 

q(T<T\xT) 


E 

q{,T^T\xT) 


^J{Pt{xt\xt-l) 

q(r<T |a:r) 


p{x>q\xq) 


> E [logpo(a^o)] - KL(g(r<T|a:r) |b(T>o|a:o)) 

q{r<_T\xT) 


(18) 

(19) 

( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


which provides a lower bound on logp(a:r) based on sample trajectories produced by the reverse¬ 
time process q when it is started at xt- The transition from equality to inequality is due to Jensen’s 
inequality. Though q{T^T\xT) and p(T>o|a:o) may at first seem incommensurable via KL, they 
both represent distributions over T-step trajectories through X space, and thus the required KL 
divergence is well-defined. Next, by adding an expectation with respect to xt 'Dx, we derive a 
lower bound on the expected log-likelihood ^-Dx [logJ'(2:T)]: 


logp(a;T) 
E [logp(a:r)] 

xt~TJx 


> 


> 


> 


> 


E 

Q'(t<t|xt) 


logpo(a;o) - log 


E 

xt^TIx 


E 

q{T<T\xT) 


q{T^T\xT) 

p{x>o\xo) 

9(r<r|xT) 


logpo(a;o) - log 


p{r>Q\xQ) \_ 


E 

qi-r) 

E 

qi-r) 


logpo(a:o) - log 
-log 


q{T<T\xT) 


P{7yo\xQ) 

'D{xT)q{T<T\XT) 


Po{xo)p{t>o\xo) . 

> -KL(g(T) ||p(r)) - iTx,^ 




(26) 

(27) 

(28) 

(29) 

(30) 


These steps follow directly from the definitions of qir^Tlxr) and q{T). In the last two equations, we 
define Hxix — ^xr-^Vx log77;f (a;)], which gives the entropy of Vx. Thus, when Vx is constant 
with respect to the trainable parameters, the training objective in ifTSl is equivalent to minimizing 
the path-based KL(( 7 (t) 11 p{t)). 


8 Additional material for Section |3] 
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The LSTM-based generative model from Section [2l4| 



Algorithm 1 GenTrainLoopl( x* ) 

1: Set So, SO) and Cq from constants. 

2; Compute nllg = — logp(a;*|co). 

3: Set klo to 0. 

4: for t = 1 to r do 

5: Update st ^ f 4 ,{st-i.g 4 ,{st-i,Ct-i,x*)). 

6 ; Sample Zt ~ q^{zt\st). 

7: Update St ^ fg{st-i,Zt). 

8: Update Ct ^ Cf_i + ujg{st) (or uJe{st))- 

9: Compute klf = KL{q^{zt\st) ||p 0 ( 2 t)). 

10; Compute nllt = — logp(a:*|ct). 

11 : end for 

12; return Co:r, idler, and klo:r’ 


Figure 4; Left: this figure illustrates the structure of the LSTM-based model from ID, as described 
in Sec. |2.4| Single-edged nodes are deterministic and double-edged nodes are stochastic. Dashed 
nodes and edges are present only during training. Right: this figure provides pseudo-code for the 
loop that computes all values required for computing this model’s training objective. The objective 
follows the form of Eqn. To simplify notation, we don’t distinguish between the visible/hidden 
states of the LSTMs. 


The extended LSTM-based generative model from Section [23| 



Algorithm 1 GenTrainLoop2( x* ) 

1 : Sample zq ~ q 4 ,izo\x*). 

2: Set So and so from fe{zo)- 
3: Set Co from a constant. 

4; Compute klo = ^^Hq 4 >{zo\x*) ||p 0 ( 2 o))- 
5: Compute nllo = —log p{a:* I Co). 

6 : for t = 1 to r do 

7; Update St f^{st-i,g 4 ,{st-i,Ct-i,x*)). 

8: Sample zt ~ q 4 ,{zt\st). 

9: Update St feist-i, zt). 

10; Update Cf <- Ct_i + a; 9 (st) (or cc^(st)). 

11 ; Compute kit = KL(g^( 2 t|sf) ||pe( 2 t|st_i)). 
12; Compute nllf = —log p(a:* I Cf). 

13: end for 

14: return Cq-t', nllo:r, and klo:T- 


Figure 5: Left: this figure illustrates the structure of the extended LSTM-based model described 
in Sec. |2.5| Single-edged nodes are deterministic and double-edged nodes are stochastic. Dashed 
nodes and edges are present only during training. Right: this figure provides pseudo-code for the 
loop that computes all values required for computing this model’s training objective. The objective 
follows the form of Eqn. To simplify notation, we don’t distinguish between the visible/hidden 
states of the LSTMs. 
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The direct imputation model from Section [3111 



Algorithm 1 ImpTrainLoopl( x, m ) 

1 : Set a;*, ■<—ApplyMask(a:, m). 

2 : Set Co from a constant. 

3: Set nllo and klo to 0. 

4: for t— I loT do 

5 : Sample ~ 50(2f|ct_i,a;'',a;“). 

6 : Update Cf Ct_i + ujs{zt) {or cJoizt)). 

7: Computeklt = KL{q^{zt\ct-i,x^ ,x^)\\pe{zt\ct-ux^)). 

8: Compute nllt = — logp( 2 :“|Cf). 

9: end for 

10: return Cq;!’, nlloj- and klo:T- 


Figure 6: Left; this figure illustrates the structure of the “direct” imputation model described in 
Sec. o Single-edged nodes are deterministic and double-edged nodes are stochastic. All solid 
lines affect the primary and guide policies. All dashed lines affect only the guide policy. Right: 
this figure provides pseudo-code for the loop that computes all values required for computing this 
model’s training objective. The objective follows the form of Eqn. To simplify notation, we 
don’t distinguish between the visible/hidden states of the LSTMs. 


The LSTM-hased imputation model Section |3l2| 



Algorithm 1 ImpTrainLoop2( m ) 

1 : Set x’^^x'^ ApplyMask(a:, m). 

2 : Set Co from a constant. 

3: Sample Zq ~ ggi,( 2 o|co, a:^', a:“). 

4: Set Sq) '^0 1 '^0 from f${zQ). 

5: Compute klo = KL(g 0 ( 2 o|co,a;“) ||pe( 2 o|co,a;^)). 
6 : Compute nllo = “ logp(^“|co)' 

7: for t = 1 to T do 

8 : Update ^ C(_i,x'',a:“)). 

9: Update sj' ^ Ct_i, 

10 : Sample Zt ~ q^{zt\s^). 

11 : Update sf •<- fe{st-i, Zt). 

12 : Update Ct Ct-i +uj‘^{sf) (ora;^(s“)). 

13: Compute kh = KL(( 7 ^( 2 t|S(^) \\p8{zt\sl)). 

14: Compute nllt = “ logp(a;“|ct). 

15: end for 

16: return Co:T; nllpj, and klQ;^. 


Figure 7: Left: this figure illustrates the structure of the “LSTM” imputation model described in 
Sec. |3.2| Single-edged nodes are deterministic and double-edged nodes are stochastic. All solid 
lines affect the primary and guide policies. All dashed lines affect only the guide policy. Right: 
this figure provides pseudo-code for the loop that computes all values required for computing this 
model’s training objective. The objective follows the form of Eqn. To simplify notation, we 
don’t distinguish between the visible/hidden states of the LSTMs. 


9 Additional Material for Experiments and Model Implementations 

9.1 Model Implementation Details 

For purely generative tests, all LSTMs had hidden and visible states in We ran the LSTMs for 
16 steps. Eor our extended model in Sec. |2.5| the variational distribution over zq was computed using 
a feedforward network with a single hidden layer of 250 tanh units. Samples of zq were converted 
into initial hidden/visible states for the primary and guide LSTMs using a feedforward network with 
a single hidden layer of 250 tanh units. The latent variable zq was in and the latent variables zt 
for f > 0 were in 

We trained the models using minibatches of size 250. Lor each example in the minibatch we sampled 
a single trajectory from the guide policy. The necessary KL divergences were computed via partial 
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Rao-Blackwellisation, i.e. at each step we computed a 1-step KL analytically, and the sum of these 
provided an estimator whose mean was the full-trajectory KL. 

In the generative tests, we trained the “raw” model for 200k updates. The variational posterior fine- 
tuning stage lasted 50k updates. We used the ADAM algorithm for optimization El, which includes 
both first-order momentum-like smoothing and second-order Adagrad-like rescaling. We used a 
learning rate 0.0002 for all models in all tests. 

The imputation tests added a “reader” LSTM to the generative model (i.e. the primary policy). This 
had precisely the same structure as the guide LSTM. However, rather than inputting [ct; Ct] at each 
step (which includes information about the target values in cc*), we simply input [c*; Cj]. This was 
the first thing we tried, and it worked alright, but could probably be improved. 

We used the rather new Blocks framework for managing all of our LSTM-based models, though we 
only really used the framework for managing the THEANO computation graph EOl fTl. All training 
and data management were done manually in our test scripts. In addition to the LSTM-based models, 
we also implemented the GPSI models and baselines using THEANO. 

We trained our GPSI models using the same basic setup as for the LSTM models. Eor MNIST tests, 
the three networks underlying the model were built using two hidden layers of 1000 ReLU units. 
Eor the TED and SVHN tests the layers were increased to 1500 units. We used latent variables 
Zt S for MNIST and zt G for TED/SVHN. Batch sizes and optimization method were 
the same as for the LSTMs. Code is available on Github. Due to computation/time constraints we 
performed little/no hyperparameter search. The GPSI results should improve somewhat with better 
architecture choices. Adding the localized read/write mechanisms from lH may help too. 

9.2 Problems with VAE Imputation 

Variational autoencoder imputation proceeds by running multiple steps of iterative sampling from 
the approximate posterior q{z\x) and then from the reconstruction distribution p{x\z), with the 
known values replaced by their true values at each step. I.e. the missing values are repeatedly 
guessed based on the previous guessed values, combined with the true known values. 

Consider an extreme case in which the mutual information between z and x in the joint distribution 
p{x, z) = p{x\z)p{z), arising from combining p{x\z) with the latent prior p{z), is 0. In this case, 
even if the marginal over x, i.e. p(x) = equal to the target distribution Vx, each 

sample of new guesses for the missing values will be sampled independently from the marginal over 
those values in Vx- Thus, the new guesses will be informed by neither the previous guesses nor the 
known part of the observation for which imputation is being performed. 

In addition to this fundamental defect, the VAE approach to imputation also suffers due to the poste¬ 
rior inference model q{z\x) lacking any prior experience with heavily perturbed observations. I.e., 
if all training is performed on unperturbed observations, then the response of q{z\x) can not be guar¬ 
anteed to remain useful when presented with observations from a different, perturbed distribution. 

While one could train a basic VAE for imputation by sampling random “VAE imputation” trajec¬ 
tories and then backpropagating the imputation log-likelihood through those trajectories, we em¬ 
pirically found that this was largely ineffective. In a strong sense, the problem with this approach 
is analogous to that solved (in certain situations) by guided policy search. I.e., the primary policy 
is initially so poor that an, e.g., policy gradient approach to training it will be uninformative and 
ineffective. By incorporating privileged information in the guide policy, one can slowly shepherd 
the initially poor primary policy towards gradually improving behavior. 

9.3 Additional Qualitative Results for GPSI Models 
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Figure 8: This figure illustrates roll-outs of (a) additive (b) jump, and (c) variational auto-encoder 
policies trained on MNIST as described in the main text. The ways in which the additive and 
jump policies proceed towards their final imputations are visually distinct. We ran two independent 
roll-outs of each policy type for each initial state, to exhibit the ability of our models to produce 
multimodal imputation densities. All initial states were generated by randomly occluding a 16x16 
block of pixels in images taken from the validation set. I.e. these initial conditions were never 
experienced during training. Zoom in for best viewing. 
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Figure 9: This figure illustrates roll-outs of (a) additive (b) jump, and (c) variational auto-encoder 
policies trained on (grayscale) SVHN as described in the main text. The ways in which the additive 
and jump policies proceed towards their final imputations are visually distinct. We ran two inde¬ 
pendent roll-outs of each policy type for each initial state, to exhibit the ability of our models to 
produce multimodal imputation densities. All initial states were generated by randomly occluding 
an 17x17 block of pixels in images taken from the validation set. I.e. these initial conditions were 
never experienced during training. Zoom in for best viewing. 
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(c) 

Figure 10; This figure illustrates roll-outs of (a) additive (b) jump, and (c) variational auto-encoder 
policies trained on TFD as described in the main text. The ways in which the additive and jump 
policies proceed towards their final imputations are visually distinct. In particular, the “strategy” 
pursued by the jump policy is not intuitively clear. We ran two independent roll-outs of each policy 
type for each initial state, to exhibit the ability of our models to produce multimodal imputation 
densities. All initial states were generated by randomly occluding a 25x25 block of pixels in images 
taken from the validation set. I.e. these initial conditions were never experienced during training. 
Zoom in for best viewing. 
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